{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lesson 09 - Regular Expressions\n",
    "\n",
    "### Readings\n",
    "\n",
    "* Kuchling: [Regular Expression HOWTO (for Python)](https://docs.python.org/3/howto/regex.html)\n",
    "* [Linux Grep Tutorial](http://ryanstutorials.net/linuxtutorial/grep.php)\n",
    "* [Perl 5 Regex Cheat Sheet](https://perlmaven.com/regex-cheat-sheet)\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "* [Introduction](#intro)\n",
    "* [Examples with Text Editor](#tutorial)\n",
    "* [Basic Syntax](#syntax)\n",
    "* [sed](#sed)\n",
    "* [grep](#grep)\n",
    "* [Python](#re)\n",
    "* [Regex Cheatsheet](#cheatsheet)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"intro\"></a>\n",
    "\n",
    "### Introduction: What are regular expresions?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A *regular expression* is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions, also called \"regex\", are widely used in the UNIX world. We build regular expressions to match exactly the possible string permutations we want *and nothing more*.\n",
    "\n",
    "Here are some examples of regular expressions:\n",
    "\n",
    "`.ob` -- matches `Bob`, `sob`, `8ob`, `!ob`, ....<br>\n",
    "`Lo*l` -- matches `Ll`, `Lol`, `Looooool`, ....<br>\n",
    "`Lo?l` -- matches `Ll` and `Lol`.<br>\n",
    "`A.*e` -- matches `Ae`, `Abe`, `Arrrrrre`, ....<br> \n",
    "`[BFG]ad` -- matches `Bad`,  `Fad`, and `Gad`.<br>\n",
    "`[0-9]*\\.[0-9]*` -- matches `3.14159`, `1000.0`, ....<br>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"tutorial\"></a>\n",
    "\n",
    "### Examples with Text Editor\n",
    "\n",
    "Your favorite GUI text editor (Atom, Sublime Text, etc.) supports regular expressions with live search! \n",
    "\n",
    "1. Open `woodchuck_pandas.csv` (lessons folder) in Atom or Sublime Text and try searching for regular expressions.\n",
    "2. Find and replace text using regular expressions.\n",
    "3. Replace matching substrings using parentheses to group expressions `(regex1) (regex2)` and recall them with numerical variables `$1`, `$2`, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"syntax\"></a>\n",
    "\n",
    "### Regular Expression Syntax"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Basic character classes\n",
    "\n",
    "`a` -- A single character (here 'a').<br>\n",
    "`.` -- Any character except newline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Character classes\n",
    "\n",
    "`[bgh.]` -- One of the characters listed in the character class b,g,h or . in this case.<br>\n",
    "`[b-h]` -- Same as [bcdefgh].<br>\n",
    "`[a-z]` -- Lower case Latin letters.<br>\n",
    "`[bc-]` -- The characters b, c or - (dash).<br>\n",
    "`[^bx]` -- Negated character class. Anything except b or x."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predefined character class abbreviations\n",
    "\n",
    "Construct        | Equivalent class | Negated construct   | Equivalent negated class\n",
    "-----------------|------------------|---------------------|-------------------------\n",
    "`\\d` (a digit)   | `[0-9]`          | `\\D` (digits, not!) | `[^0-9]`\n",
    "`\\w` (word char) | `[a-zA-Z0-9_]`   | `\\W` (words, not!)  | `[^a-zA-Z0-9_]`\n",
    "`\\s` (space char)| `[ \\r\\t\\n\\f]`    | `\\S` (space, not!)  | `[^ \\r\\t\\n\\f]`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Multipliers\n",
    "\n",
    "Multiplier symbol | Bracket notation | Number of instances matched\n",
    ":----------------:|:----------------:|:-----------------:\n",
    "`*`               | `{0,}`           | 0, 1, 2, ...     \n",
    "`+`               | `{1,}`           | 1, 2, 3, ...      \n",
    "`?`               | `{0,1}`          | 0 or 1            \n",
    "                  | `{2}`            | 2                 \n",
    "                  | `{2,}`           | 2, 3, 4, ...      \n",
    "                  | `{1,3}`          | 1, 2 or 3        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Alternation\n",
    "\n",
    "`a|b` -- Match first part or second part (here 'a' or 'b')."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Grouping and capturing\n",
    "\n",
    "`(...)` -- Grouping and capture matching substring.<br>\n",
    "`\\1`, `\\2` -- Use substring in replacement (sed).<br>\n",
    "`$1`, `$2` -- Use substring in replacement (perl, Atom, Sublime Text)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Anchors\n",
    "\n",
    "`^` -- Beginning of string.<br>\n",
    "`$` -- End of string."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"sed\"></a>\n",
    "\n",
    "### sed\n",
    "\n",
    "`sed` stands for 'stream editor'. It accepts standard input, modifies it, and prints to standard output.\n",
    "\n",
    "The standard syntax for `sed` is `s/find/replace/g`, where `g` (optional) means make the change everywhere the string is found."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "echo \"AaBbBbCcCcCcDdDdDdDd\n",
    "Aa\n",
    "BbBb\n",
    "CcCcCc\n",
    "DdDdDdDd\" > test_regex.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AaBbBbCcCcCcDdDdDdDd\r\n",
      "Aa\r\n",
      "BbBb\r\n",
      "CcCcCc\r\n",
      "DdDdDdDd\r\n"
     ]
    }
   ],
   "source": [
    "cat test_regex.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AaBbBbDdDdDdDd\n",
      "Aa\n",
      "BbBb\n",
      "\n",
      "DdDdDdDd\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "sed 's/Cc//g' test_regex.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AaBbBbCcCcCcZzZzZzZz\n",
      "Aa\n",
      "BbBb\n",
      "CcCcCc\n",
      "ZzZzZzZz\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "sed 's/Dd/Zz/g' test_regex.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AaBbBbCcCcCcDdDdDdDd\n",
      "Aa\n",
      "BbBb\n",
      "CcCcCc\n",
      "DdDdDdDd\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "sed 's/.*(Dd)*/XXX/g' test_regex.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "XXX\n",
      "XXX\n",
      "XXX\n",
      "XXX\n",
      "XXX\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "sed 's/.*\\(Dd\\)*/XXX/g' test_regex.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `sed -E` instead of `sed` will give you access to the advanced Perl regular expression syntax."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "XXX\n",
      "XXX\n",
      "XXX\n",
      "XXX\n",
      "XXX\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "sed -E 's/.*(Dd)*/XXX/g' test_regex.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "XXX\n",
      "Aa\n",
      "BbBb\n",
      "CcCcCc\n",
      "XXX\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "sed -E 's/.*(Dd)+/XXX/g' test_regex.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"grep\"></a>\n",
    "\n",
    "### grep\n",
    "\n",
    "`grep` comes from the phrase \"Global search for Regular Expression and Print matching lines\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `egrep` or `grep -E` instead of `grep` will give you access to the advanced Perl regular expression syntax."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AaBbBbCcCcCcDdDdDdDd\n",
      "Aa\n",
      "BbBb\n",
      "CcCcCc\n",
      "DdDdDdDd\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cat test_regex.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AaBbBbCcCcCcDdDdDdDd\n",
      "Aa\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "egrep \"Aa\" test_regex.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CcCcCc\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "egrep \"^Cc\" test_regex.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AaBbBbCcCcCcDdDdDdDd\n",
      "CcCcCc\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "egrep \"(Cc){3}\" test_regex.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AaBbBbCcCcCcDdDdDdDd\n",
      "CcCcCc\n",
      "DdDdDdDd\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "egrep \"([A-Z][a-z]){3}\" test_regex.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"re\"></a>\n",
    "\n",
    "### Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Adapted from [TutorialsPoint](http://www.tutorialspoint.com/python/python_reg_expressions.htm).\n",
    "\n",
    "The module `re` provides full support for Perl-like regular expressions in Python. The `re` module raises the exception `re.error` if an error occurs while compiling or using a regular expression.\n",
    "\n",
    "We use three important functions -- `match`, `search` and `replace` -- to handle regular expressions. But a small thing first: There are various characters that would have special meaning if they were used in regular expressions. To avoid any confusion while dealing with regular expressions, we use raw strings as in `r'expression'`.\n",
    "\n",
    "To use regular expressions in Python, use the `re` module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### The `search` function\n",
    "\n",
    "This function searches for first occurrence of RE *pattern* within *string* with optional *flags*.\n",
    "\n",
    "Here is the syntax for this function:\n",
    "\n",
    "\tre.search(pattern, string, flags=0)\n",
    "    \n",
    "Here is the description of the parameters:\n",
    "\n",
    "| Parameter | Description |\n",
    "|:----------|:------------|\n",
    "| pattern   | This is the regular expression to be matched.|\n",
    "| string    | This is the string, which would be searched to match the pattern at the beginning of string.|\n",
    "| flags     | Several modifiers are available, but the most common one is `re.I` for case-insensitive search.|\n",
    "\n",
    "Both `re.search` and `re.match` function returns a match object on success, None on failure. We use `group(num)` or `groups()` function of match object to get matched expression:\n",
    "\n",
    "| Match Object Method | Description |\n",
    "|:--------------------|:------------|\n",
    "| group(num=0)        | This method returns entire match (or specific subgroup num) |\n",
    "| groups()            | This method returns all matching subgroups in a tuple (empty if there weren't any) |\n",
    "\n",
    "Note: There is a similar function `match` that will match only if the string begins with the regular expression. This function is unnecessary; we can simply use `search` and begin our regular expression with a `^`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example 1\n",
    "\n",
    "x = 'Begin With Review And Friend'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<re.Match object; span=(6, 10), match='With'>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.search(r'W[a-z]*', x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'With'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.search(r'W[a-z]*', x).group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = re.search(r'W[a-z]*', x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'With'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "re.match(r'W[a-z]*', x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<re.Match object; span=(0, 5), match='Begin'>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.match(r'Begin', x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<re.Match object; span=(0, 5), match='Begin'>"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.search(r'^Begin', x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "re.search(r'^With', x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example 2\n",
    "\n",
    "line = \"3.8 Liters in 1 Gallon\"\n",
    "\n",
    "searchObj = re.search(r'([.0-9]*) liters in ([.0-9]*) gallon', line, re.I)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('3.8', '1')"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "searchObj.groups()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'3.8 Liters in 1 Gallon'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "searchObj.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "searchObj.group() :  3.8 Liters in 1 Gallon\n",
      "searchObj.group(1) :  3.8\n",
      "searchObj.group(2) :  1\n"
     ]
    }
   ],
   "source": [
    "if searchObj:\n",
    "   print(\"searchObj.group() : \", searchObj.group())\n",
    "   print(\"searchObj.group(1) : \", searchObj.group(1))\n",
    "   print(\"searchObj.group(2) : \", searchObj.group(2))\n",
    "else:\n",
    "   print(\"Nothing found!!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The sub Function\n",
    "\n",
    "One of the most important re methods that use regular expressions is `sub`, which allows you to search and replace.\n",
    "\n",
    "Syntax:\n",
    "\n",
    "\tre.sub(pattern, repl, string, max=0)\n",
    "\n",
    "This method replaces all occurrences of the RE *pattern* in *string* with *repl*, substituting all occurrences unless *max* provided. This method returns modified string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example 3\n",
    "\n",
    "phone = \"858-959-5559 # This is Phone Number\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'858-959-5559 '"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# delete Python-style comments\n",
    "num = re.sub(r'#.*$', \"\", phone)\n",
    "num"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'858-959-5559'"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "num.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'8589595559'"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# remove anything other than digits\n",
    "num = re.sub(r'\\D', \"\", num)    \n",
    "num"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keep and use matching substrings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'L001_R1'"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filename = 'sample_name_L001_R1_001.fastq'\n",
    "re.sub(r'.*_(L00[12])_(R[12])_.*.fastq', r'\\1_\\2', filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The compile Function\n",
    "\n",
    "Regular expressions can be compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "re.compile(r'ab*', re.UNICODE)"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Example 4\n",
    "\n",
    "p = re.compile('ab*')\n",
    "p"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<re.Match object; span=(2, 4), match='ab'>"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m = p.search('Drab')\n",
    "m"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ab'"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2, 4)"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.span()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.start()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.end()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example 5\n",
    "\n",
    "q = re.compile('[a-z]+')\n",
    "n = q.match('tempo')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "re.compile(r'[a-z]+', re.UNICODE)"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<re.Match object; span=(0, 5), match='tempo'>"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'tempo'"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0, 5)"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n.start(), n.end()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The findall Function\n",
    "\n",
    "If you want to find all examples of a regular expression in a string, `findall` will do this for you and return the results as a list. The following example will find all the lowercase letters in a string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['e', 'r', 'e', 's', 't', 'r', 'i', 'n', 'g']"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Example 6\n",
    "\n",
    "regex = re.compile('[a-z]')\n",
    "string = 'Here Is A String'\n",
    "letters = regex.findall(string)\n",
    "letters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Here', 'is', 'a', 'string']"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regex = re.compile(r'[A-Za-z][a-z]*')\n",
    "string = 'Here is a string.'\n",
    "words = regex.findall(string)\n",
    "words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Aside on joining lists..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Here is a string'"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['23', '45', '23', '22']"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "numbers = [23, 45, 23, 22]\n",
    "numbers = [str(x) for x in numbers]\n",
    "numbers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'23,45,23,22'"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "','.join(numbers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Built-in string methods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string.find('string')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string.startswith('Here')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string.endswith('string.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string.startswith('is', 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"cheatsheet\"></a>\n",
    "\n",
    "### Appendix: Regular Expression Cheatsheet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Regular Expression Patterns\n",
    "\n",
    "Except for control characters, `(+ ? . * ^ $ ( ) [ ] { } | \\)`, all characters match themselves. You can escape a control character by preceding it with a backslash.\n",
    "\n",
    "Following table lists the regular expression syntax that is available in Python:\n",
    "\n",
    "| Pattern | Description\n",
    "|:----------|:------------|\n",
    "| ^ | Matches beginning of line.\n",
    "| `$` | Matches end of line.\n",
    "| . | Matches any single character except newline. Using m option allows it to match newline as well.\n",
    "| [...] | Matches any single character in brackets.\n",
    "| [^...] | Matches any single character not in brackets\n",
    "| re* | Matches 0 or more occurrences of preceding expression.\n",
    "| re+ | Matches 1 or more occurrence of preceding expression.\n",
    "| re? | Matches 0 or 1 occurrence of preceding expression.\n",
    "| re{n} | Matches exactly n number of occurrences of preceding expression.\n",
    "| re{n,} | Matches n or more occurrences of preceding expression.\n",
    "| re{n, m} | Matches at least n and at most m occurrences of preceding expression.\n",
    "| `a` &#124; `b` | Matches either a or b.\n",
    "| (re) | Groups regular expressions and remembers matched text.\n",
    "| \\w | Matches word characters.\n",
    "| \\W | Matches nonword characters.\n",
    "| \\s | Matches whitespace. Equivalent to [\\t\\n\\r\\f].\n",
    "| \\S | Matches nonwhitespace.\n",
    "| \\d | Matches digits. Equivalent to [0-9].\n",
    "| \\D | Matches nondigits.\n",
    "| \\A | Matches beginning of string.\n",
    "| \\Z | Matches end of string. If a newline exists, it matches just before newline.\n",
    "| \\z | Matches end of string.\n",
    "| \\G | Matches point where last match finished.\n",
    "| \\b | Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.\n",
    "| \\B | Matches nonword boundaries.\n",
    "| \\n, \\t, etc. | Matches newlines, carriage returns, tabs, etc.\n",
    "| \\1...\\9 | Matches nth grouped subexpression.\n",
    "| \\10 | Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Regular Expression Examples\n",
    "\n",
    "| Example | Description\n",
    "|:----------|:------------|\n",
    "| python | Match \"python\".\n",
    "| [Pp]ython | Match \"Python\" or \"python\"\n",
    "| rub[ye] | Match \"ruby\" or \"rube\"\n",
    "| [aeiou] | Match any one lowercase vowel\n",
    "| [0-9] | Match any digit; same as [0123456789]\n",
    "| [a-z] | Match any lowercase ASCII letter\n",
    "| [A-Z] | Match any uppercase ASCII letter\n",
    "| [a-zA-Z0-9] | Match any of the above\n",
    "| [^aeiou] | Match anything other than a lowercase vowel\n",
    "| [^0-9] | Match anything other than a digit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Special Character Classes\n",
    "\n",
    "| Example | Description\n",
    "|:----------|:------------|\n",
    "| . | Match any character except newline\n",
    "| \\d | Match a digit: [0-9]\n",
    "| \\D | Match a nondigit: [^0-9]\n",
    "| \\s | Match a whitespace character: [ \\t\\r\\n\\f]\n",
    "| \\S | Match nonwhitespace: [^ \\t\\r\\n\\f]\n",
    "| \\w | Match a single word character: [A-Za-z0-9_]\n",
    "| \\W | Match a nonword character: [^A-Za-z0-9_]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Repetition Cases\n",
    "\n",
    "| Example | Description\n",
    "|:----------|:------------|\n",
    "| ruby? | Match \"rub\" or \"ruby\": the y is optional\n",
    "| ruby* | Match \"rub\" plus 0 or more ys\n",
    "| ruby+ | Match \"rub\" plus 1 or more ys\n",
    "| \\d{3} | Match exactly 3 digits\n",
    "| \\d{3,} | Match 3 or more digits\n",
    "| \\d{3,5} | Match 3, 4, or 5 digits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Nongreedy repetition\n",
    "\n",
    "This matches the smallest number of repetitions.\n",
    "\n",
    "| Example | Description|\n",
    "|:------|:-----------|\n",
    "| <.\\*> | Greedy repetition: matches \"<python>perl>\"|\n",
    "| <.*?> | Nongreedy: matches \"<python>\" in \"<python>perl>\"|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Grouping with Parentheses\n",
    "\n",
    "| Example | Description|\n",
    "|:----------|:------------|\n",
    "| \\D\\d+ | No group: + repeats \\d\n",
    "| (\\D\\d)+ | Grouped: + repeats \\D\\d pair\n",
    "| ([Pp]ython(, )?)+ | Match \"Python\", \"Python, python, python\", etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Backreferences\n",
    "\n",
    "This matches a previously matched group again:\n",
    "\n",
    "| Example | Description |\n",
    "|:----------|:------------|\n",
    "|([Pp])ython&\\1ails | Match python&pails or Python&Pails |\n",
    "|(['\"])[^\\1]*\\1 | Single or double-quoted string. \\1 matches whatever the 1st group matched. \\2 matches whatever the 2nd group matched, etc. |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Alternatives\n",
    "\n",
    "| Example | Description\n",
    "|:----------|:------------|\n",
    "| python&#124;perl | Match \"python\" or \"perl\"\n",
    "| rub(y&#124;le)) | Match \"ruby\" or \"ruble\"\n",
    "| Python(!+&#124;\\?) | \"Python\" followed by one or more ! or one ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Anchors\n",
    "\n",
    "This needs to specify match position.\n",
    "\n",
    "| Example | Description\n",
    "|:----------|:------------|\n",
    "| ^Python | Match \"Python\" at the start of a string or internal line\n",
    "| Python$ | Match \"Python\" at the end of a string or line\n",
    "| \\APython | Match \"Python\" at the start of a string\n",
    "| Python\\Z | Match \"Python\" at the end of a string\n",
    "| \\bPython\\b | Match \"Python\" at a word boundary\n",
    "| \\brub\\B | \\B is nonword boundary: match \"rub\" in \"rube\" and \"ruby\" but not alone\n",
    "| Python(?=!) | Match \"Python\", if followed by an exclamation point.\n",
    "| Python(?!!) | Match \"Python\", if not followed by an exclamation point."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Special Syntax with Parentheses\n",
    "\n",
    "| Example | Description\n",
    "|:----------|:------------|\n",
    "| R(?#comment) | Matches \"R\". All the rest is a comment\n",
    "| R(?i)uby | Case-insensitive while matching \"uby\"\n",
    "| R(?i:uby) | Same as above\n",
    "| rub(?:y&#124;le)) | Group only without creating \\1 backreference"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
